3670: perf: use aligned pointer reads for SparkUnsafeRow field accessors by martin-augment · Pull Request #46 · martin-augment/datafusion-comet

martin-augment · 2026-03-12T11:23:47Z

3670: To review by AI

SparkUnsafeRow field offsets are always 8-byte aligned (the JVM guarantees 8-byte alignment on the base address, bitset_width is a multiple of 8, and each field slot is 8 bytes). This means we can safely use ptr::read() instead of the from_le_bytes(slice) pattern for all typed accesses, avoiding slice creation and try_into overhead. Move primitive accessor implementations out of the SparkUnsafeObject trait defaults and into each concrete impl via a macro parameterized on the read method: - SparkUnsafeRow uses ptr::read() (aligned) - SparkUnsafeArray uses ptr::read_unaligned() (may be unaligned when nested in a row's variable-length region) Also switch is_null_at/set_not_null_at in SparkUnsafeRow from read_unaligned/write_unaligned to aligned read/write, since the null bitset is always at 8-byte aligned offsets within the row.

The test_append_null_struct_field_to_struct_builder test used a plain [u8; 16] stack buffer with no alignment guarantee. Since is_null_at performs aligned i64 reads, Miri flags this as undefined behavior when the buffer lands at a non-8-byte-aligned address. Wrap the buffer in a #[repr(align(8))] struct to match the alignment that real Spark UnsafeRow data always has from JVM memory.

coderabbitai · 2026-03-12T11:24:01Z

Walkthrough

The changes modify Spark unsafe object accessors by introducing macro-based primitive getter implementations. In list.rs, a macro is added to ensure unaligned-aware reads for SparkUnsafeArray. In row.rs, nine new public primitive getter methods are introduced (get_boolean, get_byte, get_short, get_int, get_long, get_float, get_double, get_date, get_timestamp), inline implementations are replaced with macro-generated versions, and alignment assumptions are updated to use aligned reads instead of unaligned reads for SparkUnsafeRow primitive access.

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-3670-2026-03-12-11-23-44

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-12T11:24:03Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors how primitive fields are accessed within Spark UnsafeRow and UnsafeArray structures in Rust, improving performance and correctness by explicitly handling memory alignment. It introduces a macro to generate accessor methods that use either aligned ptr::read() for SparkUnsafeRow (which guarantees 8-byte alignment) or ptr::read_unaligned() for SparkUnsafeArray (which may not be aligned). This change optimizes memory access patterns based on the underlying data structure's alignment properties.

Highlights

Refactored Primitive Field Accessors: Introduced a new macro impl_primitive_accessors to generate type-specific field accessor methods for SparkUnsafeObject, moving implementations from direct trait methods into this macro.
Optimized SparkUnsafeRow Access: Configured SparkUnsafeRow to use aligned ptr::read() for field access, leveraging its guaranteed 8-byte alignment for improved performance and correctness.
Safe SparkUnsafeArray Access: Configured SparkUnsafeArray to use ptr::read_unaligned() for field access, accounting for its potential unaligned base address when nested within a row's variable-length region.
Bitset Operation Alignment: Updated is_null_at and set_not_null_at methods in SparkUnsafeRow to use aligned ptr::read() and ptr::write() for bitset manipulation, clarifying the alignment guarantee.
Test Case Alignment Improvement: Modified a test case to ensure SparkUnsafeRow is initialized with an 8-byte aligned buffer, reflecting real-world Spark UnsafeRow layouts more accurately.

Changelog

native/core/src/execution/shuffle/spark_unsafe/list.rs
- Imported the new impl_primitive_accessors macro.
- Applied impl_primitive_accessors!(read_unaligned) to SparkUnsafeArray to ensure safe access for potentially unaligned data.
native/core/src/execution/shuffle/spark_unsafe/row.rs
- Added new primitive accessor methods (e.g., get_boolean, get_int, get_long, get_date, get_timestamp) to the SparkUnsafeObject trait definition.
- Removed the previous inline implementations of these primitive accessor methods from the SparkUnsafeObject trait.
- Defined the impl_primitive_accessors macro, which generates the actual implementations of the primitive accessor methods using either ptr::read() or ptr::read_unaligned().
- Applied impl_primitive_accessors!(read) to SparkUnsafeRow, utilizing aligned reads due to its guaranteed 8-byte alignment.
- Updated the is_null_at and set_not_null_at methods to use aligned ptr::read() and ptr::write() respectively, with comments clarifying the alignment guarantee.
- Modified a test case (test::test_append_struct_with_null_field) to initialize SparkUnsafeRow with an 8-byte aligned buffer.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a performance optimization by using aligned pointer reads for SparkUnsafeRow field accessors, and unaligned reads for SparkUnsafeArray. A macro is used to generate the accessor methods, which is a clean approach to handle the different alignment guarantees. While the logic for choosing between aligned and unaligned reads is sound, the implementation of the primitive accessors overlooks data endianness. This introduces a critical correctness bug that can lead to silent data corruption on big-endian architectures. I have provided a detailed comment and a code suggestion to fix this issue.

gemini-code-assist · 2026-03-12T11:25:57Z

+macro_rules! impl_primitive_accessors {
+    ($read_method:ident) => {
+        #[inline]
+        fn get_boolean(&self, index: usize) -> bool {
+            let addr = self.get_element_offset(index, 1);
+            debug_assert!(
+                !addr.is_null(),
+                "get_boolean: null pointer at index {index}"
+            );
+            // SAFETY: addr points to valid element data within the row/array region.
+            unsafe { *addr != 0 }
+        }
+
+        #[inline]
+        fn get_byte(&self, index: usize) -> i8 {
+            let addr = self.get_element_offset(index, 1);
+            debug_assert!(!addr.is_null(), "get_byte: null pointer at index {index}");
+            // SAFETY: addr points to valid element data (1 byte) within the row/array region.
+            unsafe { *(addr as *const i8) }
+        }
+
+        #[inline]
+        fn get_short(&self, index: usize) -> i16 {
+            let addr = self.get_element_offset(index, 2) as *const i16;
+            debug_assert!(!addr.is_null(), "get_short: null pointer at index {index}");
+            // SAFETY: addr points to valid element data (2 bytes) within the row/array region.
+            unsafe { addr.$read_method() }
+        }
+
+        #[inline]
+        fn get_int(&self, index: usize) -> i32 {
+            let addr = self.get_element_offset(index, 4) as *const i32;
+            debug_assert!(!addr.is_null(), "get_int: null pointer at index {index}");
+            // SAFETY: addr points to valid element data (4 bytes) within the row/array region.
+            unsafe { addr.$read_method() }
+        }
+
+        #[inline]
+        fn get_long(&self, index: usize) -> i64 {
+            let addr = self.get_element_offset(index, 8) as *const i64;
+            debug_assert!(!addr.is_null(), "get_long: null pointer at index {index}");
+            // SAFETY: addr points to valid element data (8 bytes) within the row/array region.
+            unsafe { addr.$read_method() }
+        }
+
+        #[inline]
+        fn get_float(&self, index: usize) -> f32 {
+            let addr = self.get_element_offset(index, 4) as *const f32;
+            debug_assert!(!addr.is_null(), "get_float: null pointer at index {index}");
+            // SAFETY: addr points to valid element data (4 bytes) within the row/array region.
+            unsafe { addr.$read_method() }
+        }
+
+        #[inline]
+        fn get_double(&self, index: usize) -> f64 {
+            let addr = self.get_element_offset(index, 8) as *const f64;
+            debug_assert!(!addr.is_null(), "get_double: null pointer at index {index}");
+            // SAFETY: addr points to valid element data (8 bytes) within the row/array region.
+            unsafe { addr.$read_method() }
+        }
+
+        #[inline]
+        fn get_date(&self, index: usize) -> i32 {
+            let addr = self.get_element_offset(index, 4) as *const i32;
+            debug_assert!(!addr.is_null(), "get_date: null pointer at index {index}");
+            // SAFETY: addr points to valid element data (4 bytes) within the row/array region.
+            unsafe { addr.$read_method() }
+        }
+
+        #[inline]
+        fn get_timestamp(&self, index: usize) -> i64 {
+            let addr = self.get_element_offset(index, 8) as *const i64;
+            debug_assert!(
+                !addr.is_null(),
+                "get_timestamp: null pointer at index {index}"
+            );
+            // SAFETY: addr points to valid element data (8 bytes) within the row/array region.
+            unsafe { addr.$read_method() }
+        }
+    };
+}


The current implementation of primitive accessors has a correctness issue on big-endian systems. ptr::read() and ptr::read_unaligned() interpret bytes using the native endianness of the CPU. However, Spark's Unsafe format is little-endian. This will lead to incorrect data being read on big-endian architectures, causing silent data corruption.

To ensure portability and correctness, you should explicitly handle the little-endian format. For integer types, you can use T::from_le(). For floating-point types, you should read the data as an integer of the same size, convert its endianness, and then convert the bits to a float (e.g., f32::from_bits(u32::from_le(...))).

macro_rules! impl_primitive_accessors { ($read_method:ident) => { #[inline] fn get_boolean(&self, index: usize) -> bool { let addr = self.get_element_offset(index, 1); debug_assert!( !addr.is_null(), "get_boolean: null pointer at index {index}" ); // SAFETY: addr points to valid element data within the row/array region. unsafe { *addr != 0 } } #[inline] fn get_byte(&self, index: usize) -> i8 { let addr = self.get_element_offset(index, 1); debug_assert!(!addr.is_null(), "get_byte: null pointer at index {index}"); // SAFETY: addr points to valid element data (1 byte) within the row/array region. unsafe { *(addr as *const i8) } } #[inline] fn get_short(&self, index: usize) -> i16 { let addr = self.get_element_offset(index, 2) as *const i16; debug_assert!(!addr.is_null(), "get_short: null pointer at index {index}"); // SAFETY: addr points to valid element data (2 bytes) within the row/array region. // Spark's Unsafe format is little-endian, so we must convert from LE to native. unsafe { i16::from_le(addr.$read_method()) } } #[inline] fn get_int(&self, index: usize) -> i32 { let addr = self.get_element_offset(index, 4) as *const i32; debug_assert!(!addr.is_null(), "get_int: null pointer at index {index}"); // SAFETY: addr points to valid element data (4 bytes) within the row/array region. // Spark's Unsafe format is little-endian, so we must convert from LE to native. unsafe { i32::from_le(addr.$read_method()) } } #[inline] fn get_long(&self, index: usize) -> i64 { let addr = self.get_element_offset(index, 8) as *const i64; debug_assert!(!addr.is_null(), "get_long: null pointer at index {index}"); // SAFETY: addr points to valid element data (8 bytes) within the row/array region. // Spark's Unsafe format is little-endian, so we must convert from LE to native. unsafe { i64::from_le(addr.$read_method()) } } #[inline] fn get_float(&self, index: usize) -> f32 { let addr = self.get_element_offset(index, 4) as *const u32; debug_assert!(!addr.is_null(), "get_float: null pointer at index {index}"); // SAFETY: addr points to valid element data (4 bytes) within the row/array region. // Spark's Unsafe format is little-endian. Read as u32, convert, then transmute. unsafe { f32::from_bits(u32::from_le(addr.$read_method())) } } #[inline] fn get_double(&self, index: usize) -> f64 { let addr = self.get_element_offset(index, 8) as *const u64; debug_assert!(!addr.is_null(), "get_double: null pointer at index {index}"); // SAFETY: addr points to valid element data (8 bytes) within the row/array region. // Spark's Unsafe format is little-endian. Read as u64, convert, then transmute. unsafe { f64::from_bits(u64::from_le(addr.$read_method())) } } #[inline] fn get_date(&self, index: usize) -> i32 { let addr = self.get_element_offset(index, 4) as *const i32; debug_assert!(!addr.is_null(), "get_date: null pointer at index {index}"); // SAFETY: addr points to valid element data (4 bytes) within the row/array region. // Spark's Unsafe format is little-endian, so we must convert from LE to native. unsafe { i32::from_le(addr.$read_method()) } } #[inline] fn get_timestamp(&self, index: usize) -> i64 { let addr = self.get_element_offset(index, 8) as *const i64; debug_assert!( !addr.is_null(), "get_timestamp: null pointer at index {index}" ); // SAFETY: addr points to valid element data (8 bytes) within the row/array region. // Spark's Unsafe format is little-endian, so we must convert from LE to native. unsafe { i64::from_le(addr.$read_method()) } } }; }

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-12T11:27:28Z

-    /// Returns double value at the given index of the object.
-    #[inline]
-    fn get_double(&self, index: usize) -> f64 {
-        let addr = self.get_element_offset(index, 8);


Aligned read on potentially unaligned nested struct from array

High Severity

The trait method get_struct returns a SparkUnsafeRow when called on a SparkUnsafeArray. Since SparkUnsafeRow uses aligned ptr::read() for all typed accesses, but the PR's own documentation states that SparkUnsafeArray base addresses "may be unaligned when nested within a row's variable-length region," the resulting SparkUnsafeRow from array.get_struct(idx, ...) would inherit that misalignment. Calling is_null_at, get_int, get_long, etc. on such a row invokes aligned ptr::read() on potentially unaligned memory, which is undefined behavior. This path is exercised in list.rs when arrays contain struct elements.

Additional Locations (2)

native/core/src/execution/shuffle/spark_unsafe/row.rs#L282-L283

native/core/src/execution/shuffle/spark_unsafe/list.rs#L425-L439

claude · 2026-03-12T11:28:01Z

Code Review: perf: use aligned pointer reads for SparkUnsafeRow field accessors

Overall this is a well-motivated and well-documented performance optimization. The alignment reasoning is sound and the macro approach cleanly handles the two different cases.

Correctness: Endianness assumption

This is the most important issue. The old code used from_le_bytes(), which explicitly handles little-endian byte order. The new ptr::read() / ptr::read_unaligned() calls read in native byte order.

Spark UnsafeRow stores data in little-endian byte order. On x86_64 (LE) this is equivalent, but on a big-endian target (IBM Z or BE-ARM) the primitive reads would silently produce wrong values. Meanwhile SparkUnsafeArray::new still correctly reads the element count via i64::from_le_bytes, creating an inconsistency between the two.

If Comet only targets little-endian architectures this is fine, but it should be documented. Consider a module-level comment noting the LE assumption, or a compile-time check (e.g. const _: () = assert!(cfg!(target_endian = "little"), "...");) to reject non-LE builds explicitly.

Minor: $read_method silently ignored for get_boolean and get_byte

Single-byte reads have no alignment requirement, so ignoring $read_method for these two methods is correct. However, a macro parameter being silently unused for two of its nine generated methods can confuse future readers. A short inline comment (e.g. // 1-byte reads are alignment-safe; $read_method is intentionally unused here) would clarify the intent.

Minor: Alignment claim wording

The comment "JVM guarantee" for 8-byte base-address alignment is accurate in practice for HotSpot on x86_64, but the Java Language Specification does not formally require byte[] to be 8-byte aligned -- it is a HotSpot implementation detail. Since this code only runs in that specific JVM context via JNI it is fine in practice, but "HotSpot implementation guarantee" would be slightly more precise than "JVM guarantee."

SparkUnsafeArray::is_null_at consistency

SparkUnsafeArray::is_null_at (list.rs:161) correctly continues using read_unaligned for the null bitset since the array can be at an unaligned offset. This is consistent with impl_primitive_accessors!(read_unaligned) -- good.

Test fix is correct

The #[repr(align(8))] wrapper in the Miri test is the right approach. Miri correctly flagged the plain [u8; 16] buffer as UB once is_null_at switched to aligned read(), and this fix matches the real Spark UnsafeRow layout guarantee.

Summary

Area	Assessment
Core optimization	Correct and well-reasoned
Macro design	Clean; `read` vs `read_unaligned` parameterization is a good pattern
Endianness	Needs documentation or compile-time guard (most important)
`get_boolean`/`get_byte` ignoring `$read_method`	Minor -- worth a comment
Test fix	Correct

The endianness point is the one I would ask to be addressed before merging, even if only as a module-level comment or a compile_error! for non-LE targets.

augmentcode · 2026-03-12T11:28:15Z

🤖 Augment PR Summary

Summary: Improves Spark UnsafeRow/UnsafeArray primitive field access by switching from byte-slice decoding to typed pointer reads, using aligned reads where layout guarantees alignment.

Changes:

Refactors `SparkUnsafeObject` primitive getters into per-type implementations via `impl_primitive_accessors!`.
Uses aligned `ptr::read`/`write` for `SparkUnsafeRow` null-bitset and field access; uses `read_unaligned` for `SparkUnsafeArray`.
Exports the new macro for reuse and wires it into `SparkUnsafeArray`.
Updates the struct-null test to use an 8-byte aligned backing buffer to match Spark’s layout assumptions.

Technical Notes: Adds alignment rationale to docs/comments to justify the aligned vs unaligned access strategy.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-03-12T11:28:16Z

+            let addr = self.get_element_offset(index, 4) as *const i32;
+            debug_assert!(!addr.is_null(), "get_int: null pointer at index {index}");
+            // SAFETY: addr points to valid element data (4 bytes) within the row/array region.
+            unsafe { addr.$read_method() }


ptr::$read_method() reads in native endianness, but Spark UnsafeRow/UnsafeArray encodes primitives as little-endian (e.g., SparkUnsafeArray::new uses from_le_bytes). If this crate ever targets big-endian platforms, these accessors would decode incorrectly unless you explicitly guard/document little-endian-only support.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-03-12T11:28:16Z

            let mask: i64 = 1i64 << (index & 0x3f);
            let word_offset = (self.row_addr + (((index >> 6) as i64) << 3)) as *const i64;
-            let word: i64 = word_offset.read_unaligned();
+            let word: i64 = word_offset.read();


Switching to word_offset.read() assumes row_addr is 8-byte aligned, but point_to_slice can point at arbitrary &[u8] (and Vec<u8> is not guaranteed 8-byte aligned by its type). If an unaligned slice is ever passed in, this becomes UB; it may be worth enforcing/documenting the alignment precondition at the API boundary.

Severity: high

Other Locations

native/core/src/execution/shuffle/spark_unsafe/row.rs:369

native/core/src/execution/shuffle/spark_unsafe/row.rs:370

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@native/core/src/execution/shuffle/spark_unsafe/row.rs`:
- Around line 280-283: The safe constructor
SparkUnsafeRow::point_to_slice(&[u8]) currently accepts unaligned slices but
downstream helpers generated by impl_primitive_accessors!(read) and methods
is_null_at() / set_not_null_at() use aligned ptr::read()/write(), so add a
runtime alignment check in point_to_slice that asserts the row base pointer is
8-byte aligned (e.g., check (row_addr as usize) % 8 == 0 or use
is_aligned_to(8)) and panic with a clear message if not; this enforces the
8-byte alignment invariant at the safe entry point and prevents undefined
behavior when benchmarks pass unaligned Vec<u8> buffers.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1947f5c5-159a-4107-9f1f-01cac75608cf

📥 Commits

Reviewing files that changed from the base of the PR and between d5841a1 and 8bc5761.

📒 Files selected for processing (2)

native/core/src/execution/shuffle/spark_unsafe/list.rs
native/core/src/execution/shuffle/spark_unsafe/row.rs

coderabbitai · 2026-03-12T11:34:18Z

+    // SparkUnsafeRow field offsets are always 8-byte aligned: the base address is 8-byte
+    // aligned (JVM guarantee), bitset_width is a multiple of 8, and each field slot is
+    // 8 bytes. This means we can safely use aligned ptr::read() for all typed accesses.
+    impl_primitive_accessors!(read);


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash rg -n -C3 '\bpoint_to_slice\s*\(' --type rust

Repository: martin-augment/datafusion-comet

Length of output: 1955

🏁 Script executed:

#!/bin/bash # Check the Aligned struct definition and its repr attribute sed -n '1690,1700p' native/core/src/execution/shuffle/spark_unsafe/row.rs # Check the benchmark context sed -n '225,250p' native/core/benches/row_columnar.rs

Repository: martin-augment/datafusion-comet

Length of output: 1426

🏁 Script executed:

#!/bin/bash # Find the benchmark function signature and where it's called from rg -B5 'rows: &\[Vec<u8>\]' native/core/benches/row_columnar.rs | head -40

Repository: martin-augment/datafusion-comet

Length of output: 254

🏁 Script executed:

#!/bin/bash # Find calls to run_benchmark to see how rows are created rg -A10 'run_benchmark\(' native/core/benches/row_columnar.rs | head -60

Repository: martin-augment/datafusion-comet

Length of output: 1397

🏁 Script executed:

#!/bin/bash # Find where rows Vec<Vec<u8>> is populated in the benchmark rg -B15 'run_benchmark.*rows' native/core/benches/row_columnar.rs | grep -A15 'let mut rows\|rows\.push\|rows = ' | head -50

Repository: martin-augment/datafusion-comet

Length of output: 57

🏁 Script executed:

#!/bin/bash # Get broader context - look at the main benchmark setup sed -n '1,150p' native/core/benches/row_columnar.rs | tail -80

Repository: martin-augment/datafusion-comet

Length of output: 2852

🏁 Script executed:

#!/bin/bash # Check the impl_primitive_accessors macro and how it's used with aligned reads sed -n '280,290p' native/core/src/execution/shuffle/spark_unsafe/row.rs # Also check the aligned read/write at lines 347-370 sed -n '347,375p' native/core/src/execution/shuffle/spark_unsafe/row.rs

Repository: martin-augment/datafusion-comet

Length of output: 2007

Enforce the 8-byte row alignment invariant at the safe entry point.

impl_primitive_accessors!(read) and the aligned ptr::read()/ptr::write() calls in is_null_at() and set_not_null_at() require row_addr to be 8-byte aligned, but SparkUnsafeRow::point_to_slice(&[u8]) accepts unaligned slices. The benchmark passes Vec<u8> buffers (alignment = 1) to point_to_slice(), which later causes undefined behavior when aligned pointer operations dereference misaligned addresses.

Add an alignment assertion at the safe boundary to catch this at runtime:

Suggested guard

pub fn point_to_slice(&mut self, slice: &[u8]) { + assert!( + slice.is_empty() + || (slice.as_ptr() as usize) % std::mem::align_of::<i64>() == 0, + "SparkUnsafeRow::point_to_slice requires an 8-byte aligned buffer" + ); self.row_addr = slice.as_ptr() as i64; self.row_size = slice.len() as i32; }

The test at line 1695 is safe because it uses #[repr(align(8))], but the benchmark at line 236 passes unaligned Vec<u8> slices and will panic or exhibit undefined behavior without this guard.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@native/core/src/execution/shuffle/spark_unsafe/row.rs` around lines 280 - 283, The safe constructor SparkUnsafeRow::point_to_slice(&[u8]) currently accepts unaligned slices but downstream helpers generated by impl_primitive_accessors!(read) and methods is_null_at() / set_not_null_at() use aligned ptr::read()/write(), so add a runtime alignment check in point_to_slice that asserts the row base pointer is 8-byte aligned (e.g., check (row_addr as usize) % 8 == 0 or use is_aligned_to(8)) and panic with a clear message if not; this enforces the 8-byte alignment invariant at the safe entry point and prevents undefined behavior when benchmarks pass unaligned Vec<u8> buffers.

andygrove added 4 commits March 11, 2026 12:11

Merge branch 'main' into perf/aligned-reads-spark-unsafe-row

6963a92

Merge branch 'main' into perf/aligned-reads-spark-unsafe-row

c59b81b

gemini-code-assist bot reviewed Mar 12, 2026

View reviewed changes

cursor bot reviewed Mar 12, 2026

View reviewed changes

augmentcode bot reviewed Mar 12, 2026

View reviewed changes

coderabbitai bot reviewed Mar 12, 2026

View reviewed changes

martin-augment closed this Mar 16, 2026

Conversation

martin-augment commented Mar 12, 2026

Uh oh!

coderabbitai bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

gemini-code-assist bot commented Mar 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 12, 2026

Choose a reason for hiding this comment

Aligned read on potentially unaligned nested struct from array

Uh oh!

claude bot commented Mar 12, 2026

Code Review: perf: use aligned pointer reads for SparkUnsafeRow field accessors

Uh oh!

augmentcode bot commented Mar 12, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 12, 2026 •

edited

Loading